Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic calculation of a polygon area #1534

Merged
merged 18 commits into from
Jul 20, 2017

Conversation

jnordling
Copy link
Member

@jnordling jnordling commented May 24, 2017

Proposed changes in this pull request

To automatic calculation of a polygon area (for project locations). This was added by creating a geometry_details to the SpatialUnit model, which then the area is calculated durning creation/edit of a SpatialUnit. The view handles the conversion of the units. The area is calculated and stored in meters squared. There are two filter functions which can be used in the view to convert. On the Location Detail view if Polygon type the area will be displayed. On the Project Dashboard there is the sum of the total area for that project. The geometry details has also been added to the data export for the project.

The issues and requirements for this PR is #811

When should this PR be merged

Soon, preferably

Risks

None I can think of.

Follow-up actions

I connected with @clash99 about the UI. She will be making adjustments to the style of how they are represented in the new project dashboard view in sprints to come

Checklist (for reviewing)

General

  • Is this PR explained thoroughly? All code changes must be accounted for in the PR description.
  • Is the PR labeled correctly? It should have the migration label if a new migration is added.
  • Is the risk level assessment sufficient? The risks section should contain all risks that might be introduced with the PR and which actions we need to take to mitigate these risks. Possible risks are database migrations, new libraries that need to be installed or changes to deployment scripts.

Functionality

  • Are all requirements met? Compare implemented functionality with the requirements specification.
  • Does the UI work as expected? There should be no Javascript errors in the console; all resources should load. There should be no unexpected errors. Deliberately try to break the feature to find out if there are corner cases that are not handled.

Code

  • Do you fully understand the introduced changes to the code? If not ask for clarification, it might uncover ways to solve a problem in a more elegant and efficient way.
  • Does the PR introduce any inefficient database requests? Use the debug server to check for duplicate requests.
  • Are all necessary strings marked for translation? All strings that are exposed to users via the UI must be marked for translation.

Tests

  • Are there sufficient test cases? Ensure that all components are tested individually; models, forms, and serializers should be tested in isolation even if a test for a view covers these components.
  • If this is a bug fix, are tests for the issue in place? There must be a test case for the bug to ensure the issue won’t regress. Make sure that the tests break without the new code to fix the issue.
  • If this is a new feature or a significant change to an existing feature? has the manual testing spreadsheet been updated with instructions for manual testing?

Security

  • Confirm this PR doesn't commit any keys, passwords, tokens, usernames, or other secrets.
  • Are all UI and API inputs run through forms or serializers?
  • Are all external inputs validated and sanitized appropriately?
  • Does all branching logic have a default case?
  • Does this solution handle outliers and edge cases gracefully?
  • Are all external communications secured and restricted to SSL?

Documentation

  • Are changes to the UI documented in the platform docs? If this PR introduces new platform site functionality or changes existing ones, the changes must be documented in the Cadasta Platform Documentation.
  • Are changes to the API documented in the API docs? If this PR introduces new API functionality or changes existing ones, the changes must be documented in the API docs.
  • Are reusable components documented? If this PR introduces components that are relevant to other developers (for instance a mixin for a view or a generic form) they should be documented in the Wiki.

@jnordling jnordling changed the title Feature/automatic area calculation Automatic calculation of a polygon area May 24, 2017
Copy link
Member

@oliverroick oliverroick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly ok, some comments:

  • The templates need to CSS love; @clash99 is picking that up?!
  • I think, there should be a database migration that calculates the areas for all existing spatial units.
  • The geometry_details column is not 100% clear what this is. Could we rename it to area? I'm thinking that in the future, when we have length available as well, that there should be two columns in the export, area and length, and we just leave the one empty that doesn't apply.

@@ -0,0 +1,26 @@
# -*- coding: utf-8 -*-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually rename migration files to something readable so we know what it's about when we look at it in the migrations table; something like 0003_add_geometry_field.py.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the implementation to be based on a PostGIS query, since the geometry is store and postgres has a built in area function there is no need to store it via a field anymore.

@receiver(models.signals.pre_save, sender=SpatialUnit)
def define_geometry_details(sender, instance, **kwargs):
geom = instance.geometry
from django.contrib.gis.geos.polygon import Polygon
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the import to the top to the file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been removed now all together

Copy link
Contributor

@alukach alukach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has a schema migration to add a geometry_details field and tooling to add that data when a SpatialUnit is created or changed, however this PR doesn't address the SpatialUnit instances that already exist in the database. geometry_details values should be generated for the existing data via a data migration.

Additionally, I see that the area is stored as a string. Is there any advantage to this? Keeping it as a float will keep querying options more open for future needs.

In [1]: su = SpatialUnit.objects.first()

In [2]: su.geometry_details

In [3]: su.save()

In [4]: su.geometry_details
Out[4]: {'area': '413525.04'}

Ultimately, I question if there is actually a need to store a geometry's area as another field (or a portion of another field). This data can be queried from the DB and will be gauranteed to be up-to-date. It feels like storing data in the DB that is derived from other data in the DB is an uphill battle, keeping the values in-sync always feels like a chore and is best avoided if possible.

It seems like we have two needs: 1) get area for single SpatialUnit, 2) get area for many SpatialUnit instances.

Getting the area for a single unit could be a property on the model returning self.geometry.transform(3857, clone=True).area and getting the area from multiple instances can be done via aggregation/annotation:

from django.contrib.gis.db.models.functions import Area, Transform
from django.db.models import Sum

# Get all SpatialUnits with a geometry
qs = SpatialUnit.objects.exclude(geometry=None)

# Add a field to each SpatialUnit with the area of its transformed geometry
qs = qs.annotate(area=Area(Transform('geometry', 3857)))

# Get sum of all areas
area = qs.aggregate(Sum('area'))
print(area['area__sum'].sq_m)

PostgreSQL and PostGIS are optimized for these types of operations and will likely be faster than doing them in Python.

@jnordling
Copy link
Member Author

Based on @alukach feedback I think we can remove the geometry_details from the model and simple handle this in the context with PostgreSQL and PostGIS. Which will eliminate @oliverroick concern about the naming convention of geometry_details and area... Also the back processing will not need to take place since nothing will be changing to the SpatialUnit once migrations is removed..

@alukach & @oliverroick I will make appropriate changes.. Thanks for your feedback!!!

@clash99
Copy link
Contributor

clash99 commented May 26, 2017

@oliverroick - yes, let me know once this is approved and I will need to update the project dashboard.

Copy link
Contributor

@alukach alukach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks really good. There looks like there's one possible bug in the code.

Style thought: What you have (calling location.geometry.transform(3857, clone=True).area in a few places) totally works, however it may make your life easier to just add a @property for area on the SpatialUnit model. That way you would only need to write self.geometry.transform(3857, clone=True).area once and handle situations where location.geometry == None in one location.

@@ -116,6 +116,7 @@ def get_context_data(self, *args, **kwargs):
pass

location = context['location']
context['area'] = location.geometry.transform(3857, clone=True).area
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that a location could not have a geometry value. I believe this will throw some errors in that situation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a property property name area to the SpatialUnit to handle when location.geometry == None. Also the property as you mentioned allowed the removed of redundant geometry.transform(3857, clone=True).area

else:
ac = area * 0.00024711
formated_area = format(ac, '.2f') + ' ac'
return formated_area
Copy link
Contributor

@alukach alukach May 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style thought: I believe that these filters can be reduced in size a bit. No need to declare an empty string on the first line.Rather than formated_area = format(..., you can simply return the values from those lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this to simply be the returns and removed unnecessary formated_area variable.

@jnordling
Copy link
Member Author

jnordling commented May 28, 2017

@alukach @oliverroick Thanks for all your feedback!!!, I have made the changes requested. Let me know if you see anything else that needs to be fixed.

Copy link
Member

@oliverroick oliverroick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good; thanks, Jon!

@oliverroick
Copy link
Member

@clash99 So the PR is good to go. I think we should handle the changes to the dashboard this way:

  1. We remove the changes to this PR because they don't fit in very well into the UI.
  2. We merge this PR.
  3. We merge the dashboard updates PR (Rehaul of project dashboard, project overview, organization dashboard, and organization overview. #1537).
  4. Finally, we make a new PR that includes area stats into the dashboards.

That way this and the #1537 are not dependent on each other and can be merged any time when they are ready.

@amplifi
Copy link
Contributor

amplifi commented May 31, 2017

We need to hold off on merging this PR; the solution as-is won't scale well and will introduce a hit to performance/page load times.

Location geometries infrequently (if ever) change, so their area calculations should absolutely be stored in the database table alongside the geometry. Re-calculating the area every time the value is accessed is inefficient and unnecessary. It's trivial to ensure that a location's area stays updated in the event of a change to its geometry -- use a db trigger to detect changes to the geometry and automatically recalculate and store its area. This lets us leverage the efficiencies of calculation in PostGIS (rather than Python), eliminates any issues around keeping the area value up to date, and makes accessing the area value a simple read op (which means no negative performance impact).

This is a beneficial optimization in the context of displaying a single location's area on the location detail page. However, it becomes essential in the context of our other intended use for area: displaying the cumulative area of all locations in a project on the project dashboard. In this PR's implementation, that means every single time a user views their project dashboard page, we will be querying the database for each location, calculating the area of each location, adding all those areas, and displaying the total to the user. This value doesn't get stored or cached anywhere, so every dashboard page load and every refresh for every user will be repeating those steps. @oliverroick mentioned that testing this in his local dev VM with 60k locations took ~600ms. That's with a local database, local platform server, no latencies, no other traffic, one user at a time, no other database transactions, etc. In a live production environment, this implementation will negatively and unnecessarily impact performance. Our page load times are already cumbersome on slow connections, and there's no reason to calculate these values on access.

To implement this properly, we should: calculate and re-calculate the area via trigger whenever a location's geometry is created or updated, store the area value in the database, calculate and re-calculate a project's cumulative area via trigger whenever any of its location geometries are created/updated, and store that area value in the database. Every access of location or project area is a read.

Copy link
Contributor

@amplifi amplifi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments in PR body

@jnordling
Copy link
Member Author

@amplifi Storing the area via the database was how I originally implemented this. Thank goodness for version control. Ill re-implement with storing the area in the database.

@wonderchook
Copy link
Contributor

@amplifi thoughts on storing in 1 measurement. Originally was implemented where it was stored in 4 different measurements. I think that calculation would be less of an issue. Thoughts?

cc @jnordling

@alukach
Copy link
Contributor

alukach commented May 31, 2017

@jnordling I think I've come around to agreeing with @amplifi. I was going to make the case that we should avoid storing redundant data in the DB as it felt like a premature optimization, that the queries aren't really that slow, and that we should instead lean on caching views (which would have greater performance gains than caching data in the DB as template rendering is likely our biggest cost). However, testing the area calculation for 800k locations was like 10s, which is obviously to long to run even intermittently.

DB triggers are a great way to ensure that an area is calculated when a location is uploaded or changed (I'm not familiar enough to say how easy/difficult it would be to use triggers for summing the areas of all locations in a projects), however I think the big downside to this is that managing triggers isn't particularly well supported in Django (we could lean on data migrations for this) and it divides logic between the Django system and Python which makes the system a bit more complex. Signals (as were used in the initial implementation) are a good way to avoid this but can easily miss changes (such as when using .update() or .bulk_create on a model or query set).

So yeah, even despite the complexities, I think DB triggers are the way to go. A DB trigger that would only calculate area for a few polygons that were saved and update the project with a sum does seem like the best performance/effort ratio. Remember that DB triggers do block the transaction, however I think this is acceptable being that, as said by @amplifi, reads are far more common that writes.

Regarding @wonderchook's comment, I agree that storing them in a single unit (I recommend sq meters) and converting in Python seems like the best way to keep things simple.

@jnordling
Copy link
Member Author

@alukach My understanding of the signal pre_save is that it works for update and create or any save for that matter. Is this not right? We are already doing this for the check_extent when SpatialUnit are updated/created or saved.

For bulk_create.. What about implementing a bulk_create override method for the SpatialUnit model since signals do not work in that instance. Just thoughts, I also don't see any bulk_create for SpatialUnit (but I understand wanting to plan for it)... I think triggers sounds nice in theory but as you mentioned, introducing a new level of complexity.

Anyways just thoughts.. Ill start researching DB triggers for this.

@alukach
Copy link
Contributor

alukach commented May 31, 2017

My understanding of the signal pre_save is that it works for update and create or any save for that matter. Is this not right?

Unfortunately not:

Be aware that the update() method is converted directly to an SQL statement. It is a bulk operation for direct updates. It doesn’t run any save() methods on your models, or emit the pre_save or post_save signals (which are a consequence of calling save()), or honor the auto_now field option. If you want to save every item in a QuerySet and make sure that the save() method is called on each instance, you don’t need any special function to handle that.

Admittedly, I might be acting overly picky about this. Perhaps signals are good enough for now (not sure how often anyone would actually use .update() for the geometry field or .bulk_create() for SpatialUnit instances.) I'm okay with sticking with signals for the sake of simplicity and getting this done if others are as well.

@jnordling
Copy link
Member Author

jnordling commented May 31, 2017

@alukach Im referencing, seems not terrible. http://eflorenzano.com/blog/2008/11/04/database-triggers-arent-evil-and-they-actually-kin/.. Im going to give it a go and see what happens... :)

@oliverroick
Copy link
Member

Hey @jnordling, is there any progress on this? Do you need anything from our end?

@jnordling
Copy link
Member Author

Hey @oliverroick . I think I can finishing this up in the coming days as long a django signals are ok to use?

@oliverroick
Copy link
Member

Signals should be ok, we're using them elsewhere too.

@laura-barluzzi laura-barluzzi mentioned this pull request Jun 26, 2017
20 tasks
@jnordling
Copy link
Member Author

Updated this PR to uses signals, removed the UI component so there will be no conflict with Add user measurement system #1610 or the new dashboard UI.. Also added the data migration. I think this PR needs to be tag for migration?.. Any ways have a look. @oliverroick. Let me know if it needs anything else.

if item.area:
value = item.area
else:
value = None
Copy link
Contributor

@alukach alukach Jul 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this, what's the purpose of this if item.area check? Additionally, why aren't we using value.area?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this. It was unnecessary.

geom = sp.geometry
if geom and isinstance(geom, Polygon) and geom.valid:
sp.area = geom.transform(3857, clone=True).area
sp.save()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we could do this more performantly with F expressions, pushing the calculation to the DB:

from django.contrib.gis.db.models.functions import Area, Transform
from django.db.models import F

SpatialUnit.objects.exclude(geometry=None).update(area=Area(Transform('geometry', 3857)))

My main concern is that SpatialUnit.objects.all() does not chunk the results at all by default, so this will request every field of every row from the SpatialUnit table at once which may bring up memory concerns on the server in production.

However, given my track record on this PR, you may want to take that with a grain of salt 😉 . I'll let @amplifi comment on whether she thinks doing the calculation for all SpatialUnit instances with geometries in a single transaction will be too much for the DB (I'm not familiar with the # of rows in the prod DB or the DB server's capabilities). If this is deemed to be an issue, I'd recommend still using .update() with an F expression but on batches of the queryset (you can slice the queryset to generate these batches).

Copy link
Member Author

@jnordling jnordling Jul 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the migration with your suggestion to use the F expressions. Also added so it would only preform on Polygons.

    SpatialUnit.objects.exclude(geometry=None).extra(
        where=["geometrytype(geometry) LIKE 'POLYGON'"]).update(
        area=Area(Transform('geometry', 3857)))

Copy link
Contributor

@alukach alukach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

jnordling and others added 18 commits July 20, 2017 11:14
…al to calculate geometry details, m2,ft2,ha,ac
…for both project dashboard and location details
…postgis query and project sum, updated tests and linting
…t code, made code style changes and added updated test based on new property
@oliverroick
Copy link
Member

oliverroick commented Jul 20, 2017

@amplifi — All rebase; ready to go.

Follow-up task to include areas into project dashboard and location detail pages created (#1666).

@amplifi amplifi merged commit 76495ef into master Jul 20, 2017
@amplifi amplifi deleted the feature/automatic_area_calculation branch July 20, 2017 12:51
@seav
Copy link
Contributor

seav commented Aug 1, 2017

Hi! Based on my investigation, I think the method used to compute the area of polygons is very wrong. See the bug I filed at #1689.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants